About the project



Chapter 2: Regression and model validation

1. Loading data in

learning2014 <- read.csv("/Users/suvi/Documents/GitHub/IODS-project/Data/learning2014.csv")

dim(learning2014)
## [1] 166   8
str (learning2014)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

Learning2014 dataset shows information about student, their attitudes and how they perform in tests. There are information about 166 students in the data and the variables are gender, age, attitude, deep, stra, surf and points. Gender and age are basic information about the student. Attitude shows students’ attitude towards statistics (many questions in the backround). Variables deep, stra and surf are collected together from different questions and the data shows the average of the Likert-scale (1-5). Variable points shows the students’ exam points.

2. graphical overview of the data

Graphical overwiev helps to understand the data. If we want to see the summaries of the variables, histogram is one way to see that.

hist_age <- hist(learning2014$Age)

hist_attitude <- hist(learning2014$Attitude)

As we can see, most of the students are 20-15 years old. The most common attitude points are 30-35.

Plots are also a good way to explore the data. For example we can see the points and the attitude in the same plot

library(ggplot2)

plot1 <- ggplot(learning2014, aes(x = Attitude, y = Points))

plot2 <- plot1 + geom_point()

plot2

plot3 <- plot2 + geom_smooth(method = "lm")

plot3

It seems that if the students attitude is higher the points are little bit higher too (we will test this later).

Also a good way to see data by all the variables, is to use pairs (pairs or for example ggpairs)

pairs(learning2014[-1])

library(GGally)

p <- ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

p

3. Choosing three explanatory variables -> regression analysis between attitude and points

Simple regression analysis: attitude and points

keep_columns <- c("Age", "Attitude", "deep", "Points")

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
variables <- select(learning2014, one_of(keep_columns))

qplot(Attitude, Points, data = variables) + geom_smooth(method = "lm")

linear_model <- lm(Points ~ Attitude, data = variables)

summary (linear_model)
## 
## Call:
## lm(formula = Points ~ Attitude, data = variables)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

4. Summary of the models (simple and multiple models)

Summary (linear_model)

There is statistical relationship between points and attitude because P-value is smaller than 0,05 (4.119e-09). This means that if the attitude of the student is high, the student will more likely to have good points in test.

Multiple regression analysis

linear_model2 <- lm(Points ~ Attitude + Age + deep, data = variables)

summary(linear_model2)
## 
## Call:
## lm(formula = Points ~ Attitude + Age + deep, data = variables)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.206  -3.430   0.204   3.979  10.950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.60773    3.38966   4.605 8.32e-06 ***
## Attitude     0.35941    0.05696   6.310 2.56e-09 ***
## Age         -0.07716    0.05322  -1.450    0.149    
## deep        -0.60275    0.75031  -0.803    0.423    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.307 on 162 degrees of freedom
## Multiple R-squared:  0.2043, Adjusted R-squared:  0.1896 
## F-statistic: 13.87 on 3 and 162 DF,  p-value: 4.305e-08

There is statistical relationship between points and attitude but not with points and age/deep because of the p-values. This means that the age for example does not affect so much to test points, it is the attitude what matters the most.

5. Plots

plot(linear_model)

plot(linear_model2)

From the plots you can see the residuals.



Chapter 3: Logistic regression

2. Here is the data


alc2 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=",", header=TRUE)


## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...

  • There are 382 obs and 35 variables in the dataset. Data describes students achievement in secondary education of two Portuguese schools.
  • Data includes students grades and demographic, social and school related features.
  • Data was collected by using school reports and questionnaires.

3. How I assume that variables relates with alchol consumption?

From the data I chose sex, age, Pstatus (parent’s cohabitation status) and famsup (family educational support). Here is my own hypothesis of their relationship with alcohol consumption.

  • I assume that there is not so much correlation between sex and alcohol consumption
  • I assume that age correlates a bit with alcohol consumption.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   16.00   17.00   16.59   17.00   22.00

The students age is between 15 and 22 and I guess that the age will slowly raise the alcohol consumption.

  • I assume that Pstatus correlates little with alcohol consumption. Students are young and their parents cohabitation status may have an impact to their behavior.
  • I believe that the biggest correlation with alcohol consumtion is with famsup. My own hypothesis is that if there family supports student then the alcohol consumption is lesser.

4. How the 4 chosen variables really relates with alcohol consumption?

Here is numerically and graphically pointed how sex, age, Psatus and famsup corralates with alcohol consumption.


library(dplyr)
keep_columns <- c("sex","age","Pstatus", "famsup", "alc_use")
alc3 <- select (alc2, one_of(keep_columns))

Sex

library(gmodels)
library(ggplot2)
sex <- CrossTable(alc3$sex, alc3$alc_use)

Looks like there is more alcohol use in men than in female. In female there is more often answers 1-2,5 and in male it more diffused. Looks like my hypothesis went wrong.


Age

Box plots visualizes the 25th, 50th and 75th percentiles (the box), the typical range (the whiskers) and the outliers of a variable. At the age of 17-18 alcohol use is more common than in other age groups.


parents cohabitation status (Psatus)

Pstatus <- CrossTable(alc3$Pstatus, alc3$alc_use)
Pstatus

There is not so much difference between those student whose parents are divorced to those whose parents are still together. The number of students whose parents are divorced is much less than students whose parents are together. My hypothesis went wrong.



family support (famsup)

famsup <- CrossTable(alc3$famsup, alc3$alc_use)
famsup

Family support seem to reduce alcohol consumption. In both groups (famsup_no/yes) the biggest group is alc_use=1 but the group of student who havent’t had support get bigger even bigger than the other group when alc_use is 3,5. My hypothesis wasn’t exactly right.


5. Logistic regression

## 
## Call:  glm(formula = high_use ~ sex + age + Pstatus + famsup - 1, family = "binomial", 
##     data = alc2)
## 
## Coefficients:
##      sexF       sexM        age   PstatusT  famsupyes  
## -4.650529  -3.752118   0.209168  -0.204081   0.001453  
## 
## Degrees of Freedom: 382 Total (i.e. Null);  377 Residual
## Null Deviance:       529.6 
## Residual Deviance: 442.6     AIC: 452.6
##         sexF         sexM          age     PstatusT    famsupyes 
## -4.650529045 -3.752117717  0.209167575 -0.204081256  0.001453275
## 
## Call:
## glm(formula = high_use ~ sex + age + Pstatus + famsup - 1, family = "binomial", 
##     data = alc2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3652  -0.8553  -0.6947   1.2596   1.9402  
## 
## Coefficients:
##            Estimate Std. Error z value Pr(>|z|)   
## sexF      -4.650529   1.705347  -2.727  0.00639 **
## sexM      -3.752118   1.684643  -2.227  0.02593 * 
## age        0.209168   0.098717   2.119  0.03410 * 
## PstatusT  -0.204081   0.380699  -0.536  0.59191   
## famsupyes  0.001453   0.240731   0.006  0.99518   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 529.56  on 382  degrees of freedom
## Residual deviance: 442.58  on 377  degrees of freedom
## AIC: 452.58
## 
## Number of Fisher Scoring iterations: 4
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: high_use
## 
## Terms added sequentially (first to last)
## 
## 
##         Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                      382     529.56             
## sex      2   82.179       380     447.39  < 2e-16 ***
## age      1    4.526       379     442.86  0.03338 *  
## Pstatus  1    0.282       378     442.58  0.59522    
## famsup   1    0.000       377     442.58  0.99518    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Chapter 4: Clustering and classification

2. Loading data in

library (MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

There are 506 obs and 14 variables in the Boston-dataset. Data shows housing values in suburbs of Boston. It includes information for example about crime rates, rooms per dwellings and pupil-teacher ratio. Dataset is part of MASS-package which can be uploaded to R.

3. Data overview

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
hist_rm <- hist(Boston$rm)

Data can be explored by summary information or graphical methods. For example average number of rooms per dwelling can be showed by histogram. The most common rooms/dwellings are 5,5-7.

pairs(Boston)

cor_matrix <- cor(Boston)
cor_matrix
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
##            ptratio       black      lstat       medv
## crim     0.2899456 -0.38506394  0.4556215 -0.3883046
## zn      -0.3916785  0.17552032 -0.4129946  0.3604453
## indus    0.3832476 -0.35697654  0.6037997 -0.4837252
## chas    -0.1215152  0.04878848 -0.0539293  0.1752602
## nox      0.1889327 -0.38005064  0.5908789 -0.4273208
## rm      -0.3555015  0.12806864 -0.6138083  0.6953599
## age      0.2615150 -0.27353398  0.6023385 -0.3769546
## dis     -0.2324705  0.29151167 -0.4969958  0.2499287
## rad      0.4647412 -0.44441282  0.4886763 -0.3816262
## tax      0.4608530 -0.44180801  0.5439934 -0.4685359
## ptratio  1.0000000 -0.17738330  0.3740443 -0.5077867
## black   -0.1773833  1.00000000 -0.3660869  0.3334608
## lstat    0.3740443 -0.36608690  1.0000000 -0.7376627
## medv    -0.5077867  0.33346082 -0.7376627  1.0000000
library(corrplot)
## corrplot 0.84 loaded
corrplot(cor_matrix)

If we explore the dataset by using corrplot we can see the correlations between variables. There is a big correlation for example rm (average number of rooms per dwelling) and medv (median value of owner-occupied homes). There isn’t so much correlation for example between nos (nitrogen oxides concentration) and dis (weighted mean of distance to five Boston employment centres).

4. Scaling data and dividing data to training/testing sets

Let’s scale the dataset and convert it to data frame class. As we can see from the summary-information, the data variables changed due to scaling process.

Boston_scaled <- scale(Boston)
summary(Boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
class(Boston_scaled)
## [1] "matrix"
Boston_scaled2 <- as.data.frame(Boston_scaled)
summary(Boston_scaled2$crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419367 -0.410563 -0.390280  0.000000  0.007389  9.924110
quantile_vector <- quantile(Boston_scaled2$crim)
quantile_vector
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
crime <- cut(Boston_scaled2$crim, breaks = quantile_vector, include.lowest = TRUE)

library(dplyr)

Boston_scaled2 <- dplyr::select(Boston_scaled2, -crim)
Boston_scaled2 <- data.frame(Boston_scaled2, crime)

Now the old crim variable is removed and a new scaled crime variable has been added.

Let’s divide the data to training set (80 % of the data) and testing set (20 % of the data).

n <- nrow(Boston_scaled2)
ind <- sample(n, size = n * 0.8)
train <- Boston_scaled2 [ind,]
test <- Boston_scaled2 [-ind,]

5. Linear discriminant analysis

Next I will use linear discriminant analysis for classification. Analysis is used to find the combination of the variables that seperates the target variable classes. For this case the target variable is the crime variable and the predictor variables are all the other variables.

lda.fit <- lda(crime ~ ., data=train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
## [-0.419,-0.411]  (-0.411,-0.39] (-0.39,0.00739]  (0.00739,9.92] 
##       0.2524752       0.2648515       0.2425743       0.2400990 
## 
## Group means:
##                         zn      indus        chas        nox          rm
## [-0.419,-0.411]  0.9094195 -0.9114340 -0.11793298 -0.8609650  0.39675601
## (-0.411,-0.39]  -0.1013460 -0.2676973  0.02203357 -0.5739594 -0.11815477
## (-0.39,0.00739] -0.3892353  0.1651841  0.29011382  0.3833110  0.05425532
## (0.00739,9.92]  -0.4872402  1.0172187 -0.06938576  1.0439892 -0.50757997
##                        age        dis        rad        tax    ptratio
## [-0.419,-0.411] -0.8664858  0.9070544 -0.6778652 -0.7622799 -0.3476265
## (-0.411,-0.39]  -0.3545617  0.3743294 -0.5525377 -0.5056916 -0.0550060
## (-0.39,0.00739]  0.4455031 -0.3297457 -0.4252163 -0.3208426 -0.2094708
## (0.00739,9.92]   0.8069523 -0.8517270  1.6371072  1.5133254  0.7795879
##                      black       lstat         medv
## [-0.419,-0.411]  0.3760556 -0.75200613  0.485824019
## (-0.411,-0.39]   0.3156987 -0.16077581  0.004867159
## (-0.39,0.00739]  0.1040892  0.08284912  0.083971619
## (0.00739,9.92]  -0.7799415  0.90867380 -0.743826935
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.069667625  0.71862761 -0.93785568
## indus   -0.003146326 -0.23115626  0.56121723
## chas    -0.094721708 -0.11143620  0.05617874
## nox      0.390211323 -0.73079804 -1.47870463
## rm      -0.197792394 -0.14063209 -0.16242921
## age      0.185306589 -0.36477457 -0.23413234
## dis     -0.058666910 -0.35777241  0.12026997
## rad      3.340857078  1.02151173  0.06824784
## tax      0.078506932 -0.13054062  0.38652823
## ptratio  0.150950974 -0.06681335 -0.36056273
## black   -0.143677370  0.02800018  0.11623144
## lstat    0.207940637 -0.27025935  0.29225064
## medv     0.248295445 -0.41198696 -0.23776283
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9529 0.0343 0.0128
lda.arrows <- function (x, myscale = 1, arrow_heads = 0.1,
color = "orange", tex=0.75, choices = c(1,2)) {
  heads <- coef (x)
  arrows(x0 = 0, y0 = 0,
        x1 = myscale * heads[,choices [1]],
        x2 = myscale * heads[,choices [2]], col=color,
lenght = arrow_heads)
  text(myscale * heads[,choices], labels= row.names(heads),
        cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], x2 =
## myscale * : "x2" is not a graphical parameter
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], x2 =
## myscale * : "lenght" is not a graphical parameter
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], x2 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped

6. Predicting classes

First I save the crime categories and remove the crime variable from the test dataset.

crime_categories <- as.numeric(test$crime)

library(dplyr)

test_data <- dplyr::select(test, -crime)
test_data <- data.frame(test_data, crime_categories)

Next it’s time to predict the classes on the test data using LDA model.

lda.pred <- predict(lda.fit, newdata = test_data)

table (correct = crime_categories, predicted = lda.pred$class)
##        predicted
## correct [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
##       1              15              9               1              0
##       2               2             11               6              0
##       3               0              6              20              2
##       4               0              0               0             30

As can be seen from the table, the classifier predicted crime rates quite well but not perfectly.

7. k-means algorithm

Let’s first reload the Boston dataset and standardize it.

library(MASS)
data("Boston")
Boston_stand <- scale(Boston)

Now when the data is scaled it’s possible to get comparable distances between observations.

dist <- dist(Boston_stand)
summary(dist)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
dist_manhattan <- dist(Boston_stand, method = 'manhattan')
summary(dist_manhattan)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

K-means is a clustering method and now I’m going to use it to assign observations of scaled Boston data to groups based on similarity.

kmeans <- kmeans(Boston_stand, centers = 3)

pairs(Boston_stand, col = kmeans$cluster)



Chapter 5: Dimensionality reduction techniques

Loading data in

library(dplyr)
setwd("/Users/suvi/Documents/GitHub/IODS-project")
human <- read.csv("/Users/suvi/Documents/GitHub/IODS-project/Data/human3.csv", header = TRUE, sep = ",")
rownames(human) <- human$X
keep <- c("EduRatio", "Labratio", "EduExp", "LifeExp", "GNI", "MMRatio", "ABirthR", "RepPar")
human <- dplyr::select(human, one_of(keep))

1. Data overview

summary(human)
##     EduRatio         Labratio          EduExp         LifeExp     
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            MMRatio          ABirthR           RepPar     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Data has now (after data wrangling) 155 obs and 8 variables. Data captures part of human development entails. Data provides an overview of key aspects of human development.

library(GGally)
ggpairs(human)

cor(human) %>% corrplot

Looks like there is a strong correlation between LifeExp (Life Expectancy at Birth) and EduExp (Expected Years of Education). There isn’t so much correlation for example with LifeExp and MMRatio (Maternal MortalityRatio), which makes sense.

2. Principal component analysis (PCA) for not standardized data

Here is a biplot of pca_human data.

pca_human <- prcomp(human)

biplot (pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

3. Principal component analysis (PCA) for standardized data

First I standardize the dataset.

human_std <- scale(human)
summary(human_std)
##     EduRatio          Labratio           EduExp           LifeExp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             MMRatio           ABirthR            RepPar       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850

And here is a biplot of standardized pca_human dataset.

pca_human_std <- prcomp(human_std)

biplot (pca_human_std, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

As we can see, the biplots looks different. PC1 is different in standardized data than in non-standardized dataset. PC1 and PC2 in biplots diplays the observations by the first two principal components.

4. My personal interpretation of the first two principal components

A biplot is plot which aims to represent both the observations and variables of a matrix of multivariate data on the same plot.

PCA transforms the coordinate system and the new components are principal components. The origin of the new coordinate system is located in the center of the datapoints. The first PC points the direction of the highest variance and the second points the second highest.

As can be seen from the biplots above, there is difference. If the data is not standardized variables measured at different scales do not contribute equally to the analysis.

If the arrows are near to each other, there is correlation between them. What is noticed already, there is a strong correlation for example between LifeExp and EduExp and not so much correlation between LifeExp and MMRatio.

5. Multiple Correspondence Analysis (MCA)

library(FactoMineR)
data("tea")

str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36

I reduce the number of variables to six. For that I create a new dataset tea2.

library(dplyr)
keep_columns <- c("breakfast", "tea.time", "evening", "lunch", "dinner", "always")
tea2 <- dplyr::select(tea, one_of(keep_columns))
str(tea2)
## 'data.frame':    300 obs. of  6 variables:
##  $ breakfast: Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening  : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch    : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner   : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always   : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
dim(tea2)
## [1] 300   6

And here is the data visualized.

library(tidyr)

gather(tea2) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

tea2_mca <- MCA(tea2, graph = FALSE)
summary(tea2_mca)
## 
## Call:
## MCA(X = tea2, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.234   0.196   0.177   0.146   0.131   0.116
## % of var.             23.426  19.586  17.722  14.643  13.056  11.566
## Cumulative % of var.  23.426  43.012  60.734  75.378  88.434 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1             |  0.189  0.051  0.058 | -0.479  0.390  0.375 | -0.245
## 2             |  0.189  0.051  0.058 | -0.479  0.390  0.375 | -0.245
## 3             |  0.744  0.787  0.189 |  0.137  0.032  0.006 |  1.201
## 4             |  1.386  2.732  0.689 | -0.168  0.048  0.010 |  0.674
## 5             | -0.083  0.010  0.006 |  0.493  0.413  0.226 | -0.418
## 6             |  1.386  2.732  0.689 | -0.168  0.048  0.010 |  0.674
## 7             | -0.229  0.074  0.099 | -0.657  0.735  0.822 | -0.085
## 8             | -0.152  0.033  0.032 |  0.257  0.112  0.090 |  0.484
## 9             | -0.229  0.074  0.099 | -0.657  0.735  0.822 | -0.085
## 10            | -0.036  0.002  0.002 |  0.005  0.000  0.000 |  0.123
##                  ctr   cos2  
## 1              0.113  0.098 |
## 2              0.113  0.098 |
## 3              2.712  0.492 |
## 4              0.853  0.163 |
## 5              0.329  0.163 |
## 6              0.853  0.163 |
## 7              0.013  0.014 |
## 8              0.440  0.320 |
## 9              0.013  0.014 |
## 10             0.028  0.018 |
## 
## Categories (the 10 first)
##                   Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## breakfast     |  -0.455   7.063   0.191  -7.556 |  -0.594  14.424   0.326
## Not.breakfast |   0.420   6.520   0.191   7.556 |   0.549  13.314   0.326
## Not.tea time  |   0.683  14.474   0.361  10.391 |   0.267   2.645   0.055
## tea time      |  -0.529  11.219   0.361 -10.391 |  -0.207   2.051   0.055
## evening       |  -0.429   4.487   0.096  -5.359 |   0.843  20.776   0.372
## Not.evening   |   0.224   2.346   0.096   5.359 |  -0.441  10.863   0.372
## lunch         |  -1.349  18.987   0.313  -9.670 |   0.467   2.726   0.038
## Not.lunch     |   0.232   3.263   0.313   9.670 |  -0.080   0.469   0.038
## dinner        |   2.420  29.155   0.441  11.478 |  -0.296   0.522   0.007
## Not.dinner    |  -0.182   2.194   0.441 -11.478 |   0.022   0.039   0.007
##                v.test     Dim.3     ctr    cos2  v.test  
## breakfast      -9.872 |  -0.264   3.144   0.064  -4.385 |
## Not.breakfast   9.872 |   0.244   2.902   0.064   4.385 |
## Not.tea time    4.062 |  -0.228   2.132   0.040  -3.469 |
## tea time       -4.062 |   0.177   1.652   0.040   3.469 |
## evening        10.544 |   0.609  11.977   0.194   7.615 |
## Not.evening   -10.544 |  -0.318   6.262   0.194  -7.615 |
## lunch           3.351 |   0.872  10.490   0.131   6.252 |
## Not.lunch      -3.351 |  -0.150   1.803   0.131  -6.252 |
## dinner         -1.404 |   1.685  18.689   0.214   7.993 |
## Not.dinner      1.404 |  -0.127   1.407   0.214  -7.993 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.191 0.326 0.064 |
## tea.time      | 0.361 0.055 0.040 |
## evening       | 0.096 0.372 0.194 |
## lunch         | 0.313 0.038 0.131 |
## dinner        | 0.441 0.007 0.214 |
## always        | 0.004 0.378 0.420 |
plot(tea2_mca, invisible=c("ind"), habillage = "quali")

As can be seen from the picture, the variables are drawn on the first two dimensions. The distance between variable categories gives a measure of their similarity. Looks like evening and always are quite similar. Dinner differs the most from the other variables.